-
Notifications
You must be signed in to change notification settings - Fork 103
Prefill batching logic to handle chunked prefill/prefix caching for HPU #753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Due to HPU padding constraints, batching requests with existing history (ctx != 0) causes excessive memory usage, as the entire batch must be padded to the longest context, leading to OOM. This patch enforces a batch size of 1 for prefill operations when ctx != 0. Although this sacrifices some throughput in corenr cases, it effectively eliminates the OOM risk. Signed-off-by: Tony Lin <tony.lin@intel.com>
✅ CI PassedAll checks passed successfully against the following vllm commit: |
xuechendi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, @adobrzyn , may you take a second review
|
@hlin99 , hmm, after second thought, I am a little unsure how the changes impact unified attention. Will need @kzawora-intel to check |
sure. i wan't aware ua goes to the same code path. if ua can overcome HPU padding nature, we can definitely split the code into ua & non-ua paths. @kzawora-intel please advise. thx |
…PU (vllm-project#753) Logic to handle chunked prefill/prefix caching for HPU Due to HPU padding constraints, batching requests with existing history (ctx != 0) causes excessive memory usage, as the entire batch must be padded to the longest context, leading to OOM. This patch enforces a batch size of 1 for prefill operations when ctx != 0. Although this sacrifices some throughput in corenr cases, it effectively eliminates the OOM risk. Signed-off-by: Tony Lin <tony.lin@intel.com> Signed-off-by: Jin, Youzhi <youzhi.jin@intel.com>
…PU (vllm-project#753) Logic to handle chunked prefill/prefix caching for HPU Due to HPU padding constraints, batching requests with existing history (ctx != 0) causes excessive memory usage, as the entire batch must be padded to the longest context, leading to OOM. This patch enforces a batch size of 1 for prefill operations when ctx != 0. Although this sacrifices some throughput in corenr cases, it effectively eliminates the OOM risk. Signed-off-by: Tony Lin <tony.lin@intel.com>
Logic to handle chunked prefill/prefix caching for HPU
Due to HPU padding constraints, batching requests with existing
history (ctx != 0) causes excessive memory usage, as the entire
batch must be padded to the longest context, leading to OOM.
This patch enforces a batch size of 1 for prefill operations when
ctx != 0. Although this sacrifices some throughput in corenr cases,
it effectively eliminates the OOM risk.